Chinese Word Segmentation Based on Contextual Entropy
نویسندگان
چکیده
Chinese is written without word delimiters so word segmentation is generally considered a key step in processing Chinese texts. This paper presents a new statistical approach to segment Chinese sequences into words based on contextual entropy on both sides of a bigram. It is used to capture the dependency with the left and right contexts in which a bigram occurs. Our approach tries to segment by finding the word boundaries instead of the words. Experimental results show that it is effective for Chinese word segmentation.
منابع مشابه
Chinese Word Segmentation Based On Direct Maximum Entropy Model
Chinese word segmentation is a fundamental and important issue in Chinese information processing. In order to find a unified approach for Chinese word segmentation, the author develop a Chinese lexical analyzer PCWS using direct maximum entropy model. The paper presents the general description of PCWS, as well as the result and analysis of its performance at the Second International Chinese Wor...
متن کاملA Maximum Entropy Chinese Character-Based Parser
The paper presents a maximum entropy Chinese character-based parser trained on the Chinese Treebank (“CTB” henceforth). Word-based parse trees in CTB are first converted into characterbased trees, where word-level part-ofspeech (POS) tags become constituent labels and character-level tags are derived from word-level POS tags. A maximum entropy parser is then trained on the character-based corpu...
متن کاملCombination of Machine Learning Methods for Optimum Chinese Word Segmentation
This article presents our recent work for participation in the Second International Chinese Word Segmentation Bakeoff. Our system performs two procedures: Out-ofvocabulary extraction and word segmentation. We compose three out-of-vocabulary extraction modules: Character-based tagging with different classifiers – maximum entropy, support vector machines, and conditional random fields. We also co...
متن کاملChinese Word Boundaries Detection Based on Maximum Entropy Model
Among the language texts in natural language, Chinese texts are written in a continuous way with ideographic characters. Unlike other western language texts such as English, Portuguese, etc., delimiters are used to specify the word boundaries. Hence, for any Chinese information processing system such as automatic question and answering, web information retrieval, text to speech conversion, mach...
متن کاملChinese Word Segmentation Based on an Approach of Maximum Entropy Modeling
In this paper, we described our Chinese word segmentation system for the 3rd SIGHAN Chinese Language Processing Bakeoff Word Segmentation Task. Our system deal with the Chinese character sequence by using the Maximum Entropy model, which is fully automatically generated from the training data by analyzing the character sequences from the training corpus. We analyze its performance on both close...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003